Welcome to A Primer on Quantile Regression for Epidemiologists! We are excited to have you here with us today! This document will serve as a companion to the slide deck which forms the basis of our discussion in today’s workshop.
Our workshop builds off of a paper we have written as an introduction to quantile regression methods for epidemiologists. This paper is entitled “Quantile regressions as a tool to evaluate how an exposure shifts and reshapes the outcome distribution: A primer for epidemiologists” and is available on American Journal of Epidemiology at this link. The paper was supported by a grant from the National Institute on Aging (R01 AG069092, PI Vable).
While no longer with our team, this workshop would not be possible without Aayush Khadka, first author of our didactic paper.
The learning aims of this workshop can be grouped into two categories: theory and practice. In terms of theory, our aim is to introduce you to quantile regression concepts by contrasting quantile regression with mean regression and, additionally, contrasting estimators targeted at quantiles of the conditional versus unconditional outcome distribution. In terms of practice, our goal is to provide you with code to conduct quantile regression analyses, to run various post-estimation tests, to present quantile regression coefficients, and to visualize how quantile regression results can be used to demonstrate the distributional effects of an exposure.
The quantile regression code in this workshop will be presented in the context of an analysis investigating the relationship between educational attainment and later-life systolic blood pressure (SBP). This is a question that occupies a lot of our time in VabLab because education has been shown to have a strong, inverse relationship with average SBP; however, since there may be a non-linear relationship between SBP and the risk of coronary heart disease and stroke, it is imperative that we understand education’s potential role in reshaping the SBP distribution.
Our data for the practical example comes from the US Health and Retirement Study, which is a nationally representative (when using their weights), multi-cohort, biennially conducted longitudinal study of non-institutionalized US adults who are 50 years or older. Our analytic sample includes US born HRS participants who were first interviewed in 1998 or later, and had no missing covariate information (N = 8,875). Since blood pressure started being measured in the HRS in 2006, our study period ranged from 2006 to 2018.
We use self-reported total years of schooling as our exposure variable and first recorded SBP as our outcome variable. Total years of schooling takes on integer values between 5 and 17 years, with 5 indicating that the respondent had 5 or fewer years of schooling and 17 indicating that the respondent had 17 or more years of schooling. We do not use repeated measures of SBP in our data to keep the exposition of quantile regression ideas clear and uncomplicated. In our analysis, we will also include covariates such as age (linear + quadratic term), sex (female and male persons), race (Black, Latinx, White), whether the respondent was born in the US South, parental education as a marker of childhood socioeconomic status, and the year in which SBP was measured.
#install.packages("tidyverse")
library(tidyverse) #Data cleaning/manipulation
#Load dataset
data <- read_rds("QRWorkshop.rds")
summary(data)
## id age age2 gender race
## Min. : 1 Min. :51.00 Min. :2601 Min. :1.000 Min. :1.000
## 1st Qu.:2220 1st Qu.:54.00 1st Qu.:2916 1st Qu.:1.000 1st Qu.:1.000
## Median :4438 Median :57.00 Median :3249 Median :1.000 Median :1.000
## Mean :4438 Mean :59.79 Mean :3645 Mean :1.468 Mean :1.335
## 3rd Qu.:6656 3rd Qu.:62.00 3rd Qu.:3844 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :8875 Max. :97.00 Max. :9409 Max. :2.000 Max. :3.000
## southern schlyrs mom_ed dad_ed
## Min. :0.0000 Min. : 5.00 Min. : 5.00 Min. : 5.00
## 1st Qu.:0.0000 1st Qu.:12.00 1st Qu.:10.00 1st Qu.: 8.00
## Median :0.0000 Median :14.00 Median :12.00 Median :12.00
## Mean :0.3781 Mean :13.75 Mean :11.33 Mean :10.86
## 3rd Qu.:1.0000 3rd Qu.:16.00 3rd Qu.:12.00 3rd Qu.:12.00
## Max. :1.0000 Max. :17.00 Max. :17.00 Max. :17.00
## year sbp
## Min. :2006 Min. : 73.0
## 1st Qu.:2008 1st Qu.:114.0
## Median :2010 Median :125.5
## Mean :2011 Mean :127.6
## 3rd Qu.:2012 3rd Qu.:138.5
## Max. :2018 Max. :233.0
#Format categorical variables
data$gender <- factor(data$gender)
data$race <- factor(data$race)
data$year <- factor(data$year)
#Centering school years at 12 years of schooling
data$schlyrs_c <- data$schlyrs - 12
The specifics of the variables included in our dataset are provided below:
All materials can be found on our repository. Please don’t hesitate to contact any of us with questions!
When we first entered the public health/epidemiology world, we were shocked by how many abbreviations there were. And now, we’ve reached the stage of our careers where we too are using a plethora of abbreviations in our daily lives. Unfortunately, this workshop too has fallen prey to our love for abbreviations. The following table lists all the abbreviations that you will see in this workshop and their full form. Please feel free to refer back to in whenever you want.
| Abbreviation | Full form | ||
|---|---|---|---|
| CDF | Cumulative Distribution Function | ||
| CEF | Conditional Expectation Function | ||
| CQF | Conditional Quantile Function | ||
| CQR | Conditional Quantile Regression | ||
| DGP | Data Generating Process | ||
| IF | Influence Function | ||
| OLS | Ordinary Least Squares | ||
| RIF | Recentered Influence Function | ||
| SBP | Systolic Blood Pressure | ||
| UQR | Unconditional Quantile Regression |
In his seminal work “Sick Individuals and Sick Populations”, Geoffrey Rose wrote about the need to reduce the incidence of disease through population-level interventions that shift the distribution of risk factors. Rose contrasted this approach with the more standard approach of prevention which focused on targeting those at high risk of disease and treating them, often with some form of medication. Additionally, Rose believed that quantifying the impact of population-level interventions on the distribution of risk factors could be done by quantifying how the intervention affects the mean of the risk factor distribution.
Rose’s ideas have influenced generations of epidemiologists, including us. Several epidemiologists have expanded his arguments to show that focusing simply on how the mean of the risk factor distribution changes may not give us full insight into how an intervention is affecting the entire distribution. In fact, these scholars have argued that we must explicitly test how a population-level intervention is affecting the tails of the risk factor distribution, especially as some of society’s most vulnerable people are located in the tails.
Blood pressure is a striking example to illustrate the need to focus on the tails of a distribution. Many studies have indicated that blood pressure may have a non-linear relationship with the risk of other poor health outcomes such as coronary heart disease or stroke. This means that a 20 unit decrease in blood pressure at high levels (say, from 180 mmHg to 160 mmHg) reduces the risk of poor health outcomes significantly more than a 20 unit decrease at low levels (say, from 140 mmHg to 120 mmHg). As such, a population-level intervention that is able to reduce blood pressure more at higher levels relative to lower levels may have a greater potential to improve population health than, say, an intervention which affects the blood pressure distribution uniformly.
Empirical epidemiology has, to date, largely focused on modeling how interventions affect the mean of an outcome. Unfortunately, focusing only on how an intervention affects the mean may not provide us with a good idea of how the same intervention is affecting the entire outcome distribution. Consider the following figure: we see that mean models are capable of capturing distributional effects only when an intervention uniformly affects the outcome distribution (“location shift”); however, when an intervention affects the variance of the distribution (“scale shift”) or both the location and scale of the outcome distribution, then mean models cannot capture distributional effects of an intervention.
#install.packages("gridExtra")
library(gridExtra) #Printing ggplots together
set.seed(2468) #Remove seed to create different distribution plots
#Location shift
cont <- rnorm(100, mean = 0, sd = 1)
loc_t <- cont + 3
ls <- data.frame("Outcome" = c(cont, loc_t),
"Group" = factor(c(rep("Control", length(cont)),
rep("Treatment", length(loc_t)))))
location_shift <- ggplot(aes(Outcome, color = Group, linetype = Group),
data = ls) +
geom_density() +
xlim(-10, 10) +
ylab("Density") +
scale_color_manual(values = c("#F2494C", "#468C8A")) +
theme_classic() +
theme(strip.background = element_blank(),
legend.position = "bottom")
#Scale shift
scl_t <- rnorm(100, mean = 0, sd = 3)
ss <- data.frame("Outcome" = c(cont, scl_t),
"Group" = factor(c(rep("Control", length(cont)),
rep("Treatment", length(scl_t)))))
scale_shift <- ggplot(aes(Outcome, color = Group, linetype = Group),
data = ss) +
geom_density() +
xlim(-10, 10) +
ylab("Density") +
scale_color_manual(values = c("#F2494C", "#468C8A")) +
theme_classic() +
theme(strip.background = element_blank(),
legend.position = "bottom")
#Location and Scale shift
ls_t <- scl_t + 3
lss <- data.frame("Outcome" = c(cont, ls_t),
"Group" = factor(c(rep("Control", length(cont)),
rep("Treatment", length(ls_t)))))
loc_scale_shift <- ggplot(aes(Outcome, color = Group, linetype = Group),
data = lss) +
geom_density() +
xlim(-10, 10) +
ylab("Density") +
scale_color_manual(values = c("#F2494C", "#468C8A")) +
theme_classic() +
theme(strip.background = element_blank(),
legend.position = "bottom")
#Final plot
grid.arrange(location_shift, scale_shift, loc_scale_shift,
ncol = 2, nrow = 2)
We argue that quantile regressions can offer us a way of understanding how an exposure affects the entire outcome distribution. In what follows we will demonstrate how quantile regressions allow us to investigate distributional effects. But first, we’ll set the stage up for quantile regression by taking a brief foray into the world of means, quantiles, and mean models, in particular, linear regression.
When we’re talking about means, we’re focusing on the arithmetic mean. We are probably all familiar with the idea of a mean or average of a random variable. Essentially, the mean represents the “center of mass” of a variable’s distribution.
We’re probably also all aware that we can estimate the mean of a random variable by taking a probability weighted sum of all elements of a random variable (e.g., in the case of a continuous variable, we would integrate the product of each element of the variable and the probability of its occurrence). Or, more simply, we could add up all the elements of a variable and divide the sum by the total number of elements (something that’s hard to do when we have a truly continuous variable with many elements).
However, there is also another approach to finding the mean of a random variable. This approach relies on solving a minimization problem. Specifically, we can find the mean of a random variable \(Y\) by minimizing:
\[ min_a \sum_{i=1}^{N}(y_i - a)^2 \]
Here, \(a\) is any value from the range of \(Y\). What this minimization problem says is that the value of \(Y\), denoted here by \(a\), which minimizes the sum of the squared difference between itself and all other values of \(Y\) is the mean of \(Y\). In other words, the mean minimizes the sum of squared differences.
What’s interesting is that we can extend this optimization framework to finding means of the conditional distribution of \(Y\). For instance, let \(X\) be the conditioning variable. Then, we can find the mean of \(Y|X\) (\(Y\) conditional on \(X\)) by minimizing:
\[ min_\beta \sum_{i=1}^{N} (y_i - g(x_i, \beta))^2 \] Here, \(g(x_i, \beta)\) is an arbitrary function of the conditioning variable. This minimization is something that you may have already seen before, for in a very specific context. For instance, let’s assume that \(g(x_i, \beta)\) can be written as a linear in parameters equation: \(\beta_0 + \beta_1 x_i\). In this case, we can write the minimization problem as:
\[ min_{\beta_0, \beta_1} \sum_{i=1}^{N} (y_i - \beta_0 - \beta_1x_i)^2 \] Notice, that this is the famous “minimizing the sum of squared errors” for a least squares regression! In fact, the whole reason for showing you all the optimization approach to finding means (and later, quantiles) is that this optimization serves as the basis for linear regression (and later, quantile regression)!
One final point about means: there’s a very nice link between the means of the conditional distribution and the mean of the unconditional distribution that’s captured by the Law of Iterated Expectations. The Law of Iterate Expectation sounds fancy, but the underlying idea is something that we might already be familiar with. Basically, the Law of Iterated Expectation says that the unconditional mean of a variable (\(E[Y]\)) can be found by taking a probability weighted sum of all the conditional means (\(E[Y|X]\)) (i.e., \(E[Y] = E_x[E[Y|X]]\)). Let’s illustrate this in the context of SBP in our data where the conditioning variable is age group (<60 years, 60-69 years, 70-79 years, and 80+ years).
#Unconditional mean
mean(data$sbp) #127.63
## [1] 127.6303
#Conditional means
data <- data %>%
mutate(age_group = case_when(age < 60 ~ "<60",
between(age, 60, 69) ~ "60-69",
between(age, 70, 79) ~ "70-79",
age >= 80 ~ "80+"))
data %>%
group_by(age_group) %>%
summarise(mean_sbp = mean(sbp),
n = n(),
product = mean_sbp*n)
## # A tibble: 4 × 4
## age_group mean_sbp n product
## <chr> <dbl> <int> <dbl>
## 1 60-69 130. 1925 251048.
## 2 70-79 134. 613 82178.
## 3 80+ 135. 524 70994.
## 4 <60 125. 5813 728498.
#Law of Iterated expectation in action. Here we're using information from the table above to multiply each conditional mean by the probability of that age group occurring in the data. The probability is calculated inside the parentheses.
125.3223*(5813/8875) + 130.4148*(1925/8875) + 134.0595*(613/8875) + 135.4838*(524/8875)
## [1] 127.6303
We will see the Law of Iterated Expectation show its full glory when we’re talking about unconditional quantile regressions later during this workshop. Additionally, it is the Law of Iterated Expectations that also allows us to interpret results from a plain vanilla linear regression model as the effect of an exposure on the unconditional mean of the outcome.
When we say quantiles, think “percentiles”. We’ll continue to use the term quantiles in this workshop since we’re talking about quantile regressions, although sometimes we’ve heard quantile regressions be called “percentile regressions” as well. We would advise you to stick with the term quantile regression as it is more standard and avoids confusion.
For a given random variable \(Y\), the \(\tau^{th}\) quantile is that value of \(Y\) such that \(\tau\)% of the values lie below it. For example, the \(50^{th}\) quantile of \(Y\) is the value of \(Y\) such that 50% of observations lie below it. Similarly, the \(25^{th}\) and \(75^{th}\) quantiles of \(Y\) are those values of \(Y\) such that 25% and 75% of observations lie below them. The \(50^{th}\) quantile is better known by the term “median”. Sometimes, the \(50^{th}\), \(25^{th}\), and \(75^{th}\) quantiles may also be called the 0.50, 0.25, and 0.75 quantiles. Finally, please note that quantiles aren’t restricted only to the \(50^{th}\), \(25^{th}\), or \(75^{th}\) quantiles. \(\tau\) could take any value between 0 and 1, so you can define any quantile between 0 and 1: 0.10 quantile, 0.18 quantile, 0.36 quantile, 0.45 quantile, 0.90 quantile…the list is actually infinite!
It’s easy to find quantiles of a random variable in R. Let’s take a look at how we’d find quantiles of the unconditional SBP distribution:
as.numeric(quantile(data$sbp, probs = c(0.10, 0.25, 0.50, 0.75, 0.90)))
## [1] 104.5 114.0 125.5 138.5 153.5
q10 <- as.numeric(quantile(data$sbp, probs = 0.10))
q25 <- as.numeric(quantile(data$sbp, probs = 0.25))
q50 <- as.numeric(quantile(data$sbp, probs = 0.50))
q75 <- as.numeric(quantile(data$sbp, probs = 0.75))
q90 <- as.numeric(quantile(data$sbp, probs = 0.90))
Quantiles have an interesting parallel to means, in the sense that we can estimate quantiles by solving an optimization problem too. For a random variable \(Y\), we can find its \(\tau^{th}\) quantile by minimizing:
\[ min_a \Sigma_{i=1}^{N} \rho_\tau(y_i - a). \]
This minimization looks similar to the minimization we did for finding means because both involve minimizing a sum. In both cases, \(a\) refers to a value of \(Y\) from the range of \(Y\). However, what’s inside the sum is different: in the case of means, we were minimizing the sum of squared deviations from \(a\) while here we’re minimizing the deviation of all values of \(Y\) from \(a\) but these deviations are plugged into the \(\rho_\tau\) function.
It’s worth interrogating the \(\rho_\tau\) function a bit more since it shows up again in conditional quantile regressions. The function \(\rho_\tau\) is defined as follows:
\[ \rho_\tau(u) = u(\tau - I(u<0)) \]
where \(\tau\) refers to the quantile of interest, \(u\) is the argument of the function, and \(I(u<0)\) is the indicator function which takes the value 1 when \(u<0\) and 0 otherwise. We use \(u\) to make the expression of \(\rho_\tau\) convenient; however, you can replace \(u\) with \(y_i - a\) to express \(\rho_\tau\) in a way that conforms to the minimization we saw above:
\[ \rho_\tau(y_i - a) = (y_i - a) (\tau - I(y_i - a < 0)). \] The \(\rho_\tau\) function is also known as the “check function” because of how it sketches out a check mark when we plot its points. In our presentation, we show the check mark plot in the case of our data with \(a = 100\) and \(a = 120\) and SBP being the random variable of interest.
Just as with means, we can find quantiles of the conditional distribution of a random variable by solving an optimization problem as well.
For instance, let \(X\) be the conditioning variable for a random variable \(Y\). Then, we can find the \(\tau^{th}\) quantile of \(Y|X\) (\(Y\) conditional on \(X\)) by minimizing:
\[ min_\beta \sum_{i=1}^{N} \rho_\tau(y_i - g(x_i, \beta)). \] Here, \(g(x_i, \beta)\) is an arbitrary function of the conditioning variable. We saw how in the case of means, the minimization to find the conditional means serves as the foundation for linear regression. The same is true for conditional quantile regressions as well! For instance, let’s assume that \(g(x_i, \beta)\) can be written as a linear in parameters equation: \(\beta_0 + \beta_1 x_i\). In this case, we can write the minimization problem as:
\[ min_{\beta_0, \beta_1} \sum_{i=1}^{N} \rho_\tau(y_i - \beta_0 - \beta_1x_i). \] This is exactly the optimization function we need to solve to estimate parameters of the conditional quantile regression! More on this later.
The figure below shows the 10th, 25th, 50th, 75th, and 90th quantiles of the unconditional SBP distribution (panel a) and SBP distribution among 80+ year olds in our data (panel b). Notice how the values of the quantiles between the unconditional and conditional distributions don’t line up particularly well. This is to be expected, of course, but this poses a challenge when it comes to working back up from conditional quantiles to unconditional quantiles.
#Unconditional density plot
unconditional_den <- ggplot(data, aes(x = sbp)) +
geom_density(aes(y = after_stat(density))) +
xlab("Systolic Blood Pressure (mmHg)") +
ylab("Density") +
labs(title = "a) Unconditional density of systolic blood pressure") +
theme_classic() +
theme(strip.background = element_blank(),
text = element_text(size = 10)) +
geom_vline(aes(xintercept = q10), color = "#F2494C", linewidth = 1) +
geom_vline(aes(xintercept = q25), color = "#FF9300", linewidth = 1) +
geom_vline(aes(xintercept = q50), color = "#468C8A", linewidth = 1) +
geom_vline(aes(xintercept = q75), color = "blue", linewidth = 1) +
geom_vline(aes(xintercept = q90), color = "purple", linewidth = 1) +
scale_x_continuous(breaks = seq(60, 240, by = 20))
#Conditional density plot (for 80+ group)
data_8 <- data %>%
dplyr::filter(age_group == "80+")
q10_8 <- as.numeric(quantile(data_8$sbp, probs = 0.10))
q25_8 <- as.numeric(quantile(data_8$sbp, probs = 0.25))
q50_8 <- as.numeric(quantile(data_8$sbp, probs = 0.50))
q75_8 <- as.numeric(quantile(data_8$sbp, probs = 0.75))
q90_8 <- as.numeric(quantile(data_8$sbp, probs = 0.90))
conditional_den <- ggplot(data_8, aes(x = sbp)) +
geom_density(aes(y = after_stat(density))) +
xlab("Systolic Blood Pressure (mmHg)") +
ylab("Density") +
labs(title = "b) Conditional density of systolic blood pressure (80+)") +
theme_classic() +
theme(strip.background = element_blank(),
text = element_text(size = 10)) +
geom_vline(aes(xintercept = q10_8), color = "#F2494C", linewidth = 1) +
geom_vline(aes(xintercept = q25_8), color = "#FF9300", linewidth = 1) +
geom_vline(aes(xintercept = q50_8), color = "#468C8A", linewidth = 1) +
geom_vline(aes(xintercept = q75_8), color = "blue", linewidth = 1) +
geom_vline(aes(xintercept = q90_8), color = "purple", linewidth = 1) +
scale_x_continuous(breaks = seq(60, 240, by = 20))
grid.arrange(unconditional_den, conditional_den, nrow = 2)
With means, we could rely on the Law of Iterated Expectation to help us back out the unconditional mean from the conditional means. Unlike means, there is no Law of Iterated Quantiles (yet), which means that we usually cannot use information about quantiles of the conditional distribution to learn about the same quantiles of the unconditional distribution. This in turn means that we need to be careful about choosing which quantile is of substantive interest before conducting a quantile regression analysis.
Choosing between conditional and unconditional quantiles really depends on what research question you are interested in investigating. Are you interested in making comparisons of the exposure-outcome relationship at different quantiles across groups based on individual characteristics? In such a case, you are likely interested in quantiles of the conditional outcome distribution, where the conditioning variables are those individual-level characteristics. On the other hand, if you are interested in understanding what would happen to the distribution of an outcome in the entire population under different levels of an exposure, then you may be interested in quantiles of the unconditional outcome distribution.
Some additional nuances: if you believe that your exposure of interest only leads to a location shift in the outcome distribution, then the conditional and unconditional quantiles coincide and you don’t need to choose between the two. In fact, in such a scenario, you really don’t need to do quantile regressions at all and mean models will be perfectly adequate in capturing distributional effects of an exposure. However, if you believe that the exposure: 1) affects the variance of the outcome distribution and 2) interacts with other covariates you include in your model, then the conditional and unconditional quantiles will not coincide and you will need to be careful about identifying which outcome distribution you are truly interested in.
Let’s say our objective is to quantify the association of educational attainment with systolic blood pressure (SBP). A potential solution could be to fit an unadjusted linear regression model by plotting a linear line through the data.
ggplot(aes(x = schlyrs, y = sbp), data = data) +
geom_point() +
scale_x_continuous(limits = c(5, 17),
breaks = seq(5, 17, by = 1)) +
xlab("School Years") +
ylab("Systolic Blood Pressure (mmHg)") +
geom_smooth(method = "lm", formula = y ~ x, color = "#FF9300") +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13.5, face = "bold"),
plot.title = element_text(size = 16, face = "bold"))
ols_un <- lm(sbp ~ schlyrs_c, data = data)
summary(ols_un)$coefficients["schlyrs_c", ]
## Estimate Std. Error t value Pr(>|t|)
## -1.091731e+00 8.734720e-02 -1.249875e+01 1.509392e-35
But what exactly is linear regression modeling? The linear regression we fit in our sample is a model of the population regression line, which itself is a model of the conditional expectation function (CEF). The CEF is a function which describes how the conditional expectation of one variable changes as we change values of the conditioning variable. The CEF is usually written as \(E[Y|X]\) or in the context of our example, \(E[SPB|schlyrs]\). When the CEF is linear, i.e., when the conditional means of a variable can be connected by a straight line, then the population regression line and the CEF coincide. When the CEF is non-linear, the population regression line provides the best linear approximation to the non-linear CEF, where “best” is defined in terms of the line that minimizes the expected mean squared error.
#install.packages("gganimate")
#install.packages("gifski")
#install.packages("magick")
#Loading some libraries
library(gganimate) #Animation plots
library(gifski) #Graphics
library(magick) #Graphics
#Estimating the conditional means of SBP by level of schooling
#Reducing data
df_linreg_basic <- data %>%
select("schlyrs", "sbp", "gender") %>%
group_by(schlyrs) %>%
mutate(mean_sbp = mean(sbp, na.rm = TRUE)) %>%
ungroup()
#Creating a dataset for plotting
df_linreg_basic_animate <- rbind(
#Step 1: Raw data only
df_linreg_basic %>% mutate(mean_sbp = NA, time = '1. Full data'),
#Step 2: SBP means only
df_linreg_basic %>% mutate(sbp = mean_sbp,
mean_sbp = NA, time = '2. Conditional means')
)
#GIF1: Animating the conditional expectation
linreg_scatter <- ggplot(data = df_linreg_basic_animate, aes(x = schlyrs, y = sbp)) +
#Regular plot
geom_point() +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13.5, face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
scale_x_continuous(limits = c(5, 17),
breaks = seq(5, 17, by = 1)) +
xlab("Total years of schooling") +
ylab("Systolic Blood Pressure (mmHg)") +
#Animation
transition_states(time, wrap = FALSE) +
ease_aes('linear') +
exit_fade() +
enter_fade()
animate(linreg_scatter,
width = 10, height = 6.67, unit = "in",
res = 300,
renderer = gifski_renderer(loop = FALSE),
fps = 20)
The conditional expectation function (CEF) tells us how the conditional mean of the outcome changes as we change values of the conditioning variable
The population regression line is represented by the equation \(\beta_0 + \beta_1x_i\) or \(\beta_0 + \beta_1schlyrs_i\) in the context of our example. The linear regression we fit in our samples represents our attempt to estimate parameters of this population regression line and each is written as \(\hat{\beta_0} + \hat{\beta_1}x_i\), where the \(\hat{}\) (“hat”) represents an estimated quantity. In the context of our example, the sample regression line is represented by the equation \(\hat{\beta_0} + \hat{\beta_1}schlyrs_i\).
#Creating a dataset of conditional means
df_sbp_means <- data %>%
group_by(schlyrs) %>%
summarise(mean_sbp = mean(sbp, na.rm = TRUE))
#CEF plot with regression line
cef_reg <- ggplot(data = df_sbp_means, aes(x = schlyrs, y = mean_sbp)) +
geom_point() +
geom_line() +
geom_smooth(data = df_linreg_basic, aes(x = schlyrs, y = sbp), method = "lm", formula = y ~ x, color = "#FF9300", se = FALSE) +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13.5, face = "bold"),
plot.title = element_text(size = 16, face = "bold")) +
scale_x_continuous(limits = c(5, 17),
breaks = seq(5, 17, by = 1)) +
scale_y_continuous(limits = c(60, 240),
breaks = seq(60, 240, by = 40)) +
xlab("Total years of schooling") +
ylab("Systolic Blood Pressure (mmHg)")
print(cef_reg)
Since it is a model of the conditional expectation function, linear regression provides us with an estimate for how the mean of the outcome changes as we change the exposure by one unit
In the context of our example, linear regression answers the question: By how much does mean systolic blood pressure change for each additional year of schooling?
Imagine that our model of the world (i.e., the data generating process) can be represented with the equation \(y_i = \beta_0 + \beta_1x_i + \epsilon_i\), where \(\epsilon\) represents the error term. This equation can be rewritten to solve for \(\epsilon\), where \(\epsilon_i = y_i - \beta_0 - \beta_1x_i\). The population regression line is \(E[Y|X] = \beta_0 + \beta_1x_i\), where we assume that \(E[\epsilon|X] = E[\epsilon] = 0\) (a.k.a, the zero conditional mean assumption) and \(Var(\epsilon) = \sigma^2\) (a.k.a, the constant variance assumption or homoskedasticity).
We can calculate the coefficients of the population regression line by minimizing the expectation of the squared error term, which we can think of as minimizing the sum of squared errors: \((\beta_0,\beta_1) = min\sum_{i}{\epsilon_i^2} = min_{\beta_0,\beta_1}\sum_{i}(y_i - \beta_0 - \beta_1x_i)^2\). As mentioned earlier, linear regression in our sample is trying to estimate parameters of this population regression line. There are several ways to estimate coefficients of our sample linear regression (e.g., ordinary least squares, moment estimators, or maximum likelihood estimation). We will focus on Ordinary Least Squares (OLS) because it’s easy, familiar, and will come in handy later when we’re talking about unconditional quantile regressions.
OLS mimics the process of finding the population regression coefficients by minimizing the estimate of the sum of squared errors in the sample: \((\hat{\beta_0},\hat{\beta_1}) = min\sum_{i}{\hat{\epsilon_i}^2} = min_{\hat{\beta_0},\hat{\beta_1}}\sum_{i}(y_i - \hat{\beta_0} - \hat{\beta_1}x_i)^2\). This minimization has an analytic solution: \(\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\) and \(\hat{\beta_1} = \frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{N}(x_i-\bar{x})^2}\). \(\hat{\beta_0}\) is the estimated average value of \(Y\) when \(x_i = 0\) (i.e., the mean of the distribution \(Y|x_i = 0\)). \(\hat{\beta_1}\) is the estimated change in the conditional mean of \(Y\) for a one unit change in \(X\).
OLS coefficients can be written in other ways too: for example, in matrix notation, we can write the linear regression model as \(\widehat{E[Y|X]} = X'\hat{\beta}\) and the solution as \(min_{\beta}\sum_{i=1}^{N}(Y - X'\hat{\beta})^2 \rightarrow \hat{\beta} = (X'X)^{-1}(X'Y)\). Here \(\hat{\beta}\) is the vector of coefficients, \(X\) is the design matrix which includes all independent variables in the model as well as a column of 1’s for the intercept, \(X'\) is the transpose of the matrix \(X\) (i.e., rows and columns switched), and \(Y\) is the vector of the outcome.
The matrix notation is particularly useful when we have more than one covariate in our regression model. What’s nice about linear regression is that regardless of how many covariates you include the interpretation of the \(\beta_1\) coefficient remains basically the same. For example, suppose we were trying to estimate parameters of the model: \(E[Y|X] = \beta_0 + \beta_1x_{1,i} + \beta_2x_{2,i}\). We could interpret our estimate for \(\beta_1\) (i.e., \(\hat{\beta_1}\)) as the estimated change in the mean value of \(Y\) due to a one unit change in \(X_1\) holding \(X_2\) constant. We fit a multivariable linear regression model below and provide an interpretation on the “schlyrs_c” coefficient:
#Fit adjusted ols model
ols <- lm(sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed +
dad_ed + year, data = data)
summary(ols)$coefficients["schlyrs_c", ]
## Estimate Std. Error t value Pr(>|t|)
## -7.905091e-01 9.296671e-02 -8.503141e+00 2.145557e-17
We can interpret coefficient on the “schlyrs_c” as: a one year increase in years of schooling is associated with a 7.9 mmHg decrease in average SBP, holding all other covariates in the model constant.
The estimated coefficients in the linear regression model are random variables themselves and have a distribution, which we call the “sampling distribution”. The coefficients are random variables in the sense that they have been fit on a sample, and fitting the sample regression (as opposed to a population regression) on different samples from the same population would produce (potentially) different values of the coefficients. Under some assumptions, the central limit theorem tells us that the sampling distribution is normally distributed with the mean equal to the population coefficient value. We use standard errors to understand how close our estimate is to the mean (i.e., standard deviation of the sampling distribution).
When the error term is homoskedastic, the standard error is calculated using \(se(\hat{\beta})=\sigma^2(X'X)^{-1}\). However, this assumption is almost always violated in practice. One way the assumption is violated is that the error variance is a function of the covariates of the model, a situation known as heteroskedasticity. In this situation, the errors are no longer assumed to be identically distributed (but still assumed to be independent). Heteroskedasticity robust standard errors can be estimated in our sample linear regression coefficient using \(se(\hat{\beta})=(X'X)^{-1}X' \Omega X(X'X)^{-1}\) where \(\Omega\) is the variance-covariance matrix
\[ \Omega = \left[\begin{array}{cccc} \sigma_1^2 & 0 & ... & 0 \\ 0 & \sigma_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \sigma_n^2 \end{array}\right] = \left[\begin{array}{c} \epsilon_1^2 & 0 & ... & 0 \\ 0 & \epsilon_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \epsilon_n^2 \end{array}\right] = \left[\begin{array}{c} (y_1-\hat{y_1})^2 & 0 & ... & 0 \\ 0 & (y_2-\hat{y_2})^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & (y_n-\hat{y_n})^2 \end{array}\right] \]
#install.packages("lmtest")
#install.packages("sandwich")
library(lmtest) #coeftest for robust standard errors
library(sandwich) #vcovHC for robust standard errors
#Calculate robust standard errors (robust t-test)
ols_robust <- coeftest(ols, vcov = vcovHC(ols, type = 'HC0'))
ols_robust["schlyrs_c", ]
## Estimate Std. Error t value Pr(>|t|)
## -7.905091e-01 9.451984e-02 -8.363420e+00 7.016933e-17
#Compare to unadjusted standard errors
summary(ols)$coefficients["schlyrs_c", ]
## Estimate Std. Error t value Pr(>|t|)
## -7.905091e-01 9.296671e-02 -8.503141e+00 2.145557e-17
Notice that our robust standard error estimate for the coefficient on “schlyrs_c” is slightly different from the “non-robust” standard error estimate we saw earlier.
#install.packages("boot")
#install.packages("simpleboot")
library(boot) #Bootstrapping
library(simpleboot) #Bootstrapping
ols_func <- function(data){
set.seed(123)
ols_mod <- lm(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data)
summary(ols_mod)
#Bootstrap standard error
ols_boot <- lm.boot(ols_mod, R = 500)
boot_sum <- summary(ols_boot)
ols_schlyrs_est <- as.numeric(boot_sum$orig.lm$coefficients["schlyrs_c"])
ols_schlyrs_sd <- as.numeric(boot_sum$stdev.params["schlyrs_c"])
ols_row <- data.frame("Quantile" = -0.1,
"Estimate" = ols_schlyrs_est,
"Lower" = ols_schlyrs_est - (1.96*ols_schlyrs_sd),
"Upper" = ols_schlyrs_est + (1.96*ols_schlyrs_sd),
"regtype" = "OLS")
return(ols_row)
}
ols_results <- ols_func(data)
ols_results[,"Estimate"]
## [1] -0.7905091
#Compare to robust standard errors
#ols_robust["schlyrs_c", ]
#Compare to unadjusted standard errors
#summary(ols)$coefficients["schlyrs_c", ]
Just as the CEF describes how the mean of a variable changes as we change values of the conditioning variable(s), the Conditional Quantile Function (CQF) describes how quantiles of a variable change as we change values of the conditioning variable. For example, in our data, the CQF for SBP at the 50th quantile conditional on years of schooling describes how the median of SBP changes as we go from those who have 5 years of schooling to 6 years of schooling to 7 years of schooling and so forth. Similarly, the CQF for SBP at the 75th quantile conditional on years of schooling describes how the 75th quantile of SBP changes as we go from 5 to 6 all the way to 17 years of schooling.
We read earlier that linear regression models the CEF. In the same way, conditional quantile regression (CQR) models the CQF at the quantile of interest. So, the population CQR is a linear model for the potentially non-linear CQF just as the population linear regression is a linear model for the potentially non-linear CEF. It follows then that the CQR we fit in our sample is an approximation of the population CQR at the quantile of interest, just as the linear regression we fit in our sample is an approximation of the population linear regression.
Just as linear regression models the conditional expectation function, conditional quantile regression models the conditional quantile function
CQR allows us to understand how an exposure affects the outcome distribution by comparing estimates of the exposure-outcome relationship across multiple quantiles and with respect to the conditional mean. For example, consider a scenario in which the CEF is flat (i.e., slope = 0) and the CQF at the 75th quantile has a negative slope. This suggests that as we move to higher values of the conditioning variable, the right tail of the conditional outcome distribution gets pulled closer and closer to the mean, which suggests a reshaping of the outcome distribution.
How do we go about estimating CQR coefficients? Earlier, we saw that estimating quantiles of the conditional distribution of a random variable required minimizing the following:
\[ min_\beta \sum_{i=1}^{N} \rho_\tau(y_i - g(x_i, \beta)). \]
The most basic CQR model replaces \(g(x_i, \beta)\) with a linear in parameters model such as \(\beta_0 + \beta_1x_i\). This leads to the minimization:
\[ min_{\beta_0, \beta_1} \sum_{i=1}^{N} \rho_\tau(y_i - \beta_0 - \beta_1x_i). \] More elaborate CQR models may include non-linear functions of the parameters, but we will not focus on those in our workshop. Rather’s let’s talk about how we can estimate the coefficients of our linear in parameters CQR model. We learned in the linear regression review that the OLS estimator for linear regression has an analytic solution. This made our life terribly easy. However, we don’t have such a luxury in the CQR world: in the minimization above, there is no closed form solution to find \(\beta_0\) and \(\beta_1\). It follows that there is no closed form solution to a linear in parameters CQR model with more than one independent variable.
As many have asked: what is to be done? Is all hope lost? Not at all! Roger Koenker and Gilbert Bassett Jr., in their 1978 paper introducing CQR, showed us that we can estimate coefficients of the CQR model described above by using linear programming methods to solve the minimization problem. Linear programming methods are numerical ways of solving optimization problems subject to certain constraints (i.e., they are algorithms that “guess” the solution to an equation and keep modifying their guess to get to the best guess of the solution). We will not go into the details of how we can re-express the minimization problem described above as a linear programming problem – it is rather hairy and honestly not particularly instructive. We will say that there are numerous linear programming optimization methods out there to solve the CQR minimization problem. In most cases, however, the default linear programming choice provided in statistical packages for CQR should be good enough for fitting the model we want to fit.
In R, we fit CQR models by using the function “rq” from the package
quantreg
to run our conditional quantile regressions (4-6). The structure of the command mirrors the
structure of the command to fit linear regression using “lm()”.
Specifically, we can fit CQR models in R as:
rq(formula, tau, data, method, …)
In the syntax above, formula takes on the structure “y ~ x”, tau refers to the quantile of interest, data refers to the dataset being used, and method refers to the optimization method you want to use to solve the linear programming problem to estimate model parameters. The default method is known as the Barrodale and Roberts algorithm (method = “br”). As the authors of the “quantreg” package suggest, the default optimization method should be perfectly fine even if you have a relatively large dataset with several thousand observations.
Let’s begin by fitting a model of SBP on years of schooling at the median without any additional control variables and think about how we can interpret the estimate.
#install.packages("quantreg")
library(quantreg) #Conditional quantile regression
#50th quantile
un_cqr_50 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.5)
summary(un_cqr_50)
##
## Call: rq(formula = sbp ~ schlyrs_c, tau = 0.5, data = data)
##
## tau: [1] 0.5
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 127.00000 0.28031 453.07029 0.00000
## schlyrs_c -0.90000 0.09484 -9.48945 0.00000
The CQR at the median produces an output that looks very similar to our linear regression output. Notice that we have an intercept term and a coefficient on the “schlyrs_c” variable. The intercept is our estimate for median SBP among those who have 12 years of schooling (recall that we have centered school years at 12). The coefficient on “schlyrs_c” can be interpreted as the association of an additional year of schooling with median SBP. So, our results suggest that median SBP among those with 12 years of schooling is 127 mmHg and that each additional year of schooling from 12 years is associated with a 0.9 mmHg reduction in median SBP.
Let’s now fit a CQR at the median adjusting for covariates:
cqr_50 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data, tau = 0.5)
summary(cqr_50)
##
## Call: rq(formula = sbp ~ schlyrs_c + age + age2 + gender + race + southern +
## mom_ed + dad_ed + year, tau = 0.5, data = data)
##
## tau: [1] 0.5
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 61.93863 13.66039 4.53418 0.00001
## schlyrs_c -0.71897 0.10217 -7.03679 0.00000
## age 1.50670 0.42296 3.56223 0.00037
## age2 -0.00834 0.00323 -2.57776 0.00996
## gender2 7.82589 0.42336 18.48530 0.00000
## race2 5.72343 0.62719 9.12558 0.00000
## race3 2.23193 0.82229 2.71429 0.00665
## southern 1.47659 0.47213 3.12750 0.00177
## mom_ed 0.00139 0.09712 0.01431 0.98858
## dad_ed -0.05735 0.08075 -0.71019 0.47761
## year2008 0.42196 0.65757 0.64169 0.52109
## year2010 1.22034 0.70360 1.73443 0.08288
## year2012 -0.24101 0.66995 -0.35974 0.71905
## year2014 -2.72761 0.94059 -2.89991 0.00374
## year2016 0.57579 0.82739 0.69591 0.48650
## year2018 -1.19128 0.95305 -1.24996 0.21135
Notice that our adjusted model also produces an output very similar to the adjusted linear regression output. The interpretation of these estimates is also very similar. For instance, the intercept refers to the median SBP among folks with 12 years of schooling, holding all other covariates at 0. The coefficient on “schlyrs_c” is our estimate for the change in median SBP for each additional year of schooling, holding all other covariates constant. Based on our results, a one year increase in total years of schooling from 12 years is associated with a 0.72 mmHg decrease in median SBP, holding all else constant.
Let’s fit another adjusted model at the 25th quantile just so we get a bit of practice interpreting.
cqr_25 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data, tau = 0.25)
summary(cqr_25)
##
## Call: rq(formula = sbp ~ schlyrs_c + age + age2 + gender + race + southern +
## mom_ed + dad_ed + year, tau = 0.25, data = data)
##
## tau: [1] 0.25
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 45.83193 12.64847 3.62352 0.00029
## schlyrs_c -0.63744 0.09497 -6.71210 0.00000
## age 1.75066 0.38690 4.52488 0.00001
## age2 -0.01093 0.00292 -3.74937 0.00018
## gender2 8.75949 0.40369 21.69874 0.00000
## race2 3.03862 0.67453 4.50482 0.00001
## race3 1.32838 0.75699 1.75483 0.07932
## southern 1.42459 0.44549 3.19784 0.00139
## mom_ed -0.04106 0.09420 -0.43592 0.66291
## dad_ed -0.00892 0.07782 -0.11459 0.90877
## year2008 1.33398 0.60955 2.18847 0.02866
## year2010 0.82936 0.78946 1.05054 0.29350
## year2012 0.32310 0.64346 0.50213 0.61559
## year2014 -1.73780 1.19992 -1.44826 0.14758
## year2016 1.99214 0.73642 2.70516 0.00684
## year2018 -1.21576 0.72686 -1.67261 0.09444
The intercept term in this model is our estimate for the 25th quantile of SBP for folks with 12 years of schooling, holding all other covariates at 0. The coefficient on “schlyrs_c” is our estimate for the change in the 25th quantile of SBP for each additional year of schooling, holding all other covariates constant. Based on our results, a one year increase in total years of schooling from 12 years is associated with a 0.64 mmHg decrease in the 25th SBP quantile, holding all else constant.
But, coefficients are only part of the story. In any regression analysis, we care deeply about the standard errors too! The default standard error estimation method is the so called “rank inversion” method described in a 2005 paper by Roger Koenker. This default option assumes that the error term is independent and identically distributed, which is usually not a plausible assumption.
The recommended method of estimating standard errors is to bootstrap them. We can use built-in commands with the “quantreg” package to estimate bootstrap standard errors. In the example below, we show one way of bootstrapping. In this case, the default bootstrap method is used, which involves randomly selecting observations from the data with replacement. In this case, we have specified 50 resamples.
# cqr_50 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
# mom_ed + dad_ed + year, data = data, tau = 0.5)
summary.rq(cqr_50, se = "boot", R = 50)
##
## Call: rq(formula = sbp ~ schlyrs_c + age + age2 + gender + race + southern +
## mom_ed + dad_ed + year, tau = 0.5, data = data)
##
## tau: [1] 0.5
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 61.93863 14.16978 4.37118 0.00001
## schlyrs_c -0.71897 0.08214 -8.75330 0.00000
## age 1.50670 0.44697 3.37092 0.00075
## age2 -0.00834 0.00340 -2.44977 0.01431
## gender2 7.82589 0.43806 17.86499 0.00000
## race2 5.72343 0.58328 9.81257 0.00000
## race3 2.23193 0.77983 2.86206 0.00422
## southern 1.47659 0.44477 3.31989 0.00090
## mom_ed 0.00139 0.11517 0.01207 0.99037
## dad_ed -0.05735 0.09028 -0.63525 0.52528
## year2008 0.42196 0.60366 0.69901 0.48457
## year2010 1.22034 0.76627 1.59258 0.11129
## year2012 -0.24101 0.61130 -0.39425 0.69340
## year2014 -2.72761 1.07480 -2.53779 0.01117
## year2016 0.57579 0.86830 0.66313 0.50727
## year2018 -1.19128 0.94107 -1.26587 0.20559
Other bootstrap methods are available within the “quantreg” package. These methods include the Parzen, Wei and Ying (1994) method, He and Hu’s (2002) Markov chain bootstrap method, and the Kocherginsky et. al. (2003) bootstrap method. There is also a wild bootstrap method due to Feng et. al. (2011) available. Which method is preferable likely depends on the intricacies of the analysis you are conducting. In general, the default method seems fine for simple, cross-sectional data structures.
The standard non-parametric bootstrap estimator used for CQR approximates the “heteroskedasticity” robust standard error estimator for CQR. We are putting heteroskedasticity in quotes because we have not seen this terminology used as much in the quantile regression world. Rather, the assumption is that the density of the error term at the \(\tau^{th}\) quantile of the conditional error distribution does not equal the density of the error term at the \(\tau^{th}\) quantile of the unconditional error distribution: this is somewhat similar in flavor to the heteroskedasticity assumption, but there are subtle differences. In any case, estimating robust standard errors in the CQR framework relies on estimating the inverse of the error density at the \(\tau^{th}\) quantile. Since the error density will be sparse where the outcome is sparse, and since the outcome is usually sparse at the tails, a key criticism of CQR is that statistical power is usually a problem at the tails of a distribution, which are generally of substantive interest.
Key criticism of conditional quantile regression is that statistical power is usually a problem at the tails of the outcome distribution, which is substantively of interest
Usually, we want to fit CQR models at multiple quantiles of the conditional outcome distribution to assess how an exposure affects the outcome distribution. But, fitting so many regressions poses a challenge in terms of displaying the results. Imagine that we fit a CQR model for each quantile from the 1st to 99th quantiles. Do we really want to have a table where we show the coefficient estimates at each quantile? That would look a bit ugly and might turn off our readers.
One way that we have seen CQR results shown in the literature is the following. We fit the CQR model at each quantile from, say, the 1st to 99th quantiles. We estimate bootstrap standard errors or, at the very least, robust standard errors. We then get estimates for the 95% confidence interval on the point estimate for our exposure. Now, instead of showing each coefficient and confidence interval as a separate row in a table, we will plot them using the built in plot function:
#Fitting CQR models from the 1st to 99th quantiles
cqr_all <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data,
tau = seq(0.01, 0.99, by = 0.01)) #Model all quantiles
cqr_plot <- summary(cqr_all, se = "boot", R = 50, confint = TRUE)
#Plotting coefficients
plot(cqr_plot)
This is, however, not our favorite way to plot CQR results. We think that rather than plotting all the coefficients, it is perhaps most effective to plot coefficients for the exposure variable only. Additionally, the built in “plot” function is not as flexible as “ggplot2”, which we love. So, in order to focus the presentation of results on only the exposure variable and to use ggplot while we’re at it, we propose the following method of plotting CQR results:
cqr_func <- function(data){
result <- data.frame()
for(i in seq(0.01, 0.99, by = 0.01)){
i <- round(i, 2)
set.seed(123)
cqr_mod <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data, tau = i)
#Bootstrap standard error
cqr_boot <- summary.rq(cqr_mod, se = "boot", R = 50)
cqr_schlyrs_est <- as.numeric(cqr_boot$coefficients["schlyrs_c", "Value"])
cqr_schlyrs_sd <- as.numeric(cqr_boot$coefficients["schlyrs_c",
"Std. Error"])
cqr_row <- data.frame("Quantile" = i,
"Estimate" = cqr_schlyrs_est,
"Lower" = cqr_schlyrs_est - (1.96*cqr_schlyrs_sd),
"Upper" = cqr_schlyrs_est + (1.96*cqr_schlyrs_sd),
"regtype" = "CQR")
result <- rbind(result, cqr_row)
}
return(result)
}
cqr_results <- cqr_func(data)
##Warning: Solution may be nonunique (Produced because of categorical variables in the model)
#Plotting results
#Axis labels
yaxis <- expression(Delta~SBP~(mmHg))
xaxis <- ""
#Figure
ggplot(data = cqr_results, aes(x = Quantile, y = Estimate)) +
geom_line(color = "#468c8a", size = 1) +
geom_ribbon(aes(x = Quantile, ymin = Lower, ymax = Upper),
alpha = 0.45, fill = "#468c8a") +
geom_hline(yintercept = 0, alpha = 0.5) +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = xaxis,
y = yaxis,
title = "Association of educational attainment with systolic blood pressure") +
scale_x_continuous(breaks = c(0.01, seq(0.10, 0.90, 0.10), 0.99),
labels = c("q1", "q10", "q20", "q30", "q40", "q50",
"q60", "q70", "q80", "q90", "q99"))
This figure is just a slightly fancier version of the plot that is produced through the in-built plot functionality in the “quantreg” package. We think it looks nicer and could be a good way of displaying CQR results. Notice that educational attainment appears to have a stronger protective association with SBP at higher quantiles of SBP relative to lower quantiles. This suggests that as we go from one level of schooling to another higher level, the SBP distribution is becoming narrower and that the narrowness is due to the higher risk, right tail of the SBP distribution being pulled closer to the center. Notice also how the confidence intervals around the point estimates at higher quantiles are very large: this is an illustration of the idea we saw earlier about how CQR estimates are usually less precise at the tails because data at the tails are usually sparse.
We can also add our OLS estimate to the CQR plot so viewers can easily compare estimates at the mean to estimates at each quantile of the conditional distribution.
ols_cqr <- rbind(ols_results, cqr_results)
#Figure
ggplot(data = ols_cqr %>% filter(regtype != "OLS"),
aes(x = Quantile, y = Estimate, group = regtype,
color = regtype, fill = regtype)) +
geom_line(alpha = 50, linewidth = 0.75) +
geom_ribbon(aes(ymin = Lower, ymax = Upper), alpha = 0.27, color = NA) +
geom_point(data = ols_cqr %>% filter(regtype == "OLS")) +
geom_errorbar(data = ols_cqr %>% filter(regtype == "OLS"),
aes(ymin = Lower, ymax = Upper), width = 0.01) +
geom_hline(yintercept = 0, alpha = 0.5) +
geom_vline(xintercept = -0.04, color = "gray", linetype = "dashed") +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
text = element_text(size = 10)) +
scale_color_manual(breaks = c("OLS", "CQR"),
labels = c("OLS", "CQR"),
values = c("#F2494C", "#468c8a")) +
scale_fill_manual(breaks = c("OLS", "CQR"),
labels = c("OLS", "CQR"),
values = c("#F2494C", "#468c8a")) +
scale_x_continuous(limits = c(-0.11, 1),
breaks = c(-0.1, 0.01, seq(0.10, 0.90, 0.10), 0.99),
labels = c("OLS", "q1", "q10", "q20", "q30", "q40", "q50",
"q60", "q70", "q80", "q90", "q99")) +
labs(x = "",
y = yaxis,
color = "Model", group = "Model", fill = "Model") +
theme(legend.position = "bottom",
legend.box.spacing = unit(0, "pt"),
legend.margin = margin(-10,0,0,0),
text = element_text(size = 15))
While the figure above allows us to imagine what sort of effect educational attainment is having on SBP, it doesn’t show us explicitly what’s happening. Wouldn’t it be nice to see, for instance, a density plot at a reference level of educational attainment and then another plot at, say, a higher level of educational attainment and compare those two density plots? Now we’re going to create something like that. Please note that this plot has numerous assumptions built into it, which we will discuss in greater detail below. We also want to acknowledge that this plot was inspired by what we saw in Bind et. al. (2015) and Bind et. al. (2017), and we thank Dr. Bind for the generous insights she provided into how she created the plot.
We are going to create this plot as follows: first, we are going to restrict our data to those who have 12 years of schooling. Then, we are going to determine what quantile each individual with 12 years of schooling lies in from the 1st to 99th quantile of the SBP distribution. Following this, we are going to merge in the association of education with SBP at each quantile. Finally, we are going to predict the new value of SBP for each individual with 12 years of schooling had they had a standard deviation increase in years of schooling.
#install.packages("statar")
library(statar) #Binning data
#Subsetting those with 12 years of schooling (Note that we have centered our educational attainment variable at 0, which is why we are setting schlyrs_c == 0 here)
data_12years <- data %>%
dplyr::filter(schlyrs_c == 0) %>%
dplyr::select("sbp") #Keeping only the SBP variable to keep things neat and clean
#Determining the 1st-99th quantile of SBP for those with 12 years of schooling
data_12years$quant <- xtile(data_12years$sbp, n = 99)
#Creating a dataset of only our coefficient estimates
coef_estimates <- cqr_results %>%
dplyr::select("Quantile", "Estimate") %>%
rename("quant" = "Quantile",
"coef" = "Estimate") %>%
dplyr::mutate(quant = round(quant, 2) * 100) #Change from (0, 1) to (1, 100)
#Merging coefficient estimates with the SBP data for those with 12 years of schooling
data_12years$quant <- round(data_12years$quant, 0)
coef_estimates$quant <- round(coef_estimates$quant, 0)
data_12years <- merge(data_12years, coef_estimates, by = "quant")
#Creating a counterfactual SBP value
data_12years$counterfactual_sbp <- data_12years$sbp + data_12years$coef
#Long version of data
data_12years_long <- data_12years %>%
dplyr::select(-coef) %>%
pivot_longer(!quant, names_to = "model", values_to = "sbp") %>%
dplyr::mutate(model = case_when(model == "sbp" ~ "Factual",
model == "counterfactual_sbp" ~ "Counterfactual"))
data_12years_long$model <- factor(data_12years_long$model,
levels = c("Factual", "Counterfactual"))
ggplot(data = data_12years_long, aes(x = sbp, color = model,
linetype = model)) +
geom_density(aes(y = after_stat(density)), linewidth = 1.5) +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
strip.background = element_blank(),
text = element_text(size = 20)) +
labs(color = "Distribution",
linetype = "Distribution",
x = "SBP (mmHg)",
y = "Density") +
scale_color_manual(values = c("Factual" = "#468c8a",
"Counterfactual" = "#FF9300")) +
scale_linetype_manual(values = c("Factual" = "solid",
"Counterfactual" = "dashed"))
In the figure above, the solid teal density represents the empirical density of SBP for folks with 12 years of schooling in our data. The dotted tangerine density represents the SBP density we expect had these individuals experienced a standard deviation increase in years of schooling (~2.4 years). As we would expect based on our point estimates from the different quantile regression models, the right tail of the distribution is being pulled closer to the center in the counterfactual SBP distribution whereas the left tail is relatively unaffected. We think this is an interesting way of visualizing the distributional “effect” of education on SBP among the largest group of folks in our sample in terms of educational attainment.
But, as we said earlier, this figure makes some important assumptions which may or may not hold. The first assumption is that we assume the relationship of education with SBP is the same within unit quantile intervals. For instance, we assume that the relationship of education and SBP between the 10th and 11th quantiles is the same as the estimated relationship between education and SBP at the 10th quantile. This may be an okay assumption to make – certainly we don’t think it’s particularly devastating.
#Determining quantiles of the counterfactual SBP distribution
data_12years$quant_new <- xtile(data_12years$counterfactual_sbp, n = 99)
#Checking how many people change which quantile their SBP value falls into
data_12years$change <- data_12years$quant - data_12years$quant_new
table(data_12years$change) #No one changes quantiles
##
## 0
## 2496
Another assumption that we appear to effectively make is that of rank invariance. That is, individuals with 12 years of schooling maintain their rank in the SBP distribution even when their education increases by ~2.4 years. Notice how no one actually changes their quantile value per the table above. Those who work in quantile regressions usually find the rank invariance assumption a bit difficult to justify. We too think that rank invariance is a strong assumption here: we should probably expect a bit more changes in rank under different exposure levels due to randomness.
Fitting so many CQR results raises the question: are our estimates at, say, the 10th quantile really different from our estimates at the 90th quantile? This is a question we’ve had in our group a number of times too. We use the anova test in the “quantreg” package to test this hypothesis. For example, say we wanted to test if the 10th quantile estimate is statistically different from the 50th quantile estimate and if that’s different from the 90th quantile estimate. We may want to know this in order to understand if our data suggest that a location shift hypothesis best describes the exposure-outcome relationship or if the exposure indeed reshapes the outcome distribution. We can do this test as follows:
#Fitting CQR at the 10th, 50th, and 90th quantiles
cqr_10 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data, tau = 0.1)
cqr_50 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data, tau = 0.5)
cqr_90 <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data, tau = 0.9)
#Testing the difference in coefficients
anova(cqr_10, cqr_50)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in { 0.1 0.5 }
##
## Df Resid Df F value Pr(>F)
## 1 15 17735 5.3067 8.8e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(cqr_50, cqr_90)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in { 0.5 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 15 17735 4.8036 2.016e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(cqr_10, cqr_90)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in { 0.1 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 15 17735 10.43 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(cqr_10, cqr_50, cqr_90)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c + age + age2 + gender + race + southern + mom_ed + dad_ed + year
## Joint Test of Equality of Slopes: tau in { 0.1 0.5 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 30 26595 5.6505 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test above compared the coefficient on the years of schooling variable across the three models. Additionally, it compared the coefficient on all other covariates included in our model. Then, it gave us a p-value for the significance of all these coefficients being jointly identical. The test procedure is described in Koenker and Bassett (1982). The p-value above suggests that the coefficients are, in fact, statistically distinct from one another at the 5% level.
However, what we really wanted to test was whether the coefficient on years of schooling was identical across the three models and not whether all coefficients together were identical. We haven’t figured out a way to do this exactly, but one approach, which isn’t entirely satisfactory, is to fit the unadjusted CQR models and then compare them using “anova”.
#Fitting CQR at the 10th, 50th, and 90th quantiles, unadjusted models
un_cqr_10 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.1)
un_cqr_50 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.5)
un_cqr_90 <- rq(sbp ~ schlyrs_c, data = data, tau = 0.9)
#Testing the difference in coefficients
anova(un_cqr_10, un_cqr_50)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in { 0.1 0.5 }
##
## Df Resid Df F value Pr(>F)
## 1 1 17749 12.098 0.0005061 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(un_cqr_50, un_cqr_90)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in { 0.5 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 1 17749 25.532 4.393e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(un_cqr_10, un_cqr_90)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in { 0.1 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 1 17749 40.807 1.722e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(un_cqr_10, un_cqr_50, un_cqr_90)
## Quantile Regression Analysis of Deviance Table
##
## Model: sbp ~ schlyrs_c
## Joint Test of Equality of Slopes: tau in { 0.1 0.5 0.9 }
##
## Df Resid Df F value Pr(>F)
## 1 2 26623 20.54 1.22e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Below, we also provide a function that loops through and fits conditional quantile regressions at the 10th, 25th, 50th, 75th, and 90th quantiles of the conditional outcome distribution. This function can be modified to fit all quantiles from the 1st through the 99th, or to fit select quantiles of your choosing.
restricted_cqr_func <- function(data){
result <- data.frame()
for(i in c(0.10, 0.25, 0.50, 0.75, 0.90)){
i <- round(i, 2)
set.seed(123)
cqr_mod <- rq(sbp ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data, tau = i)
#Bootstrap standard error
cqr_boot <- summary.rq(cqr_mod, se = "boot", R = 50)
cqr_schlyrs_est <- as.numeric(cqr_boot$coefficients["schlyrs_c", "Value"])
cqr_schlyrs_sd <- as.numeric(cqr_boot$coefficients["schlyrs_c",
"Std. Error"])
cqr_row <- data.frame("Quantile" = i,
"Estimate" = cqr_schlyrs_est,
"Lower" = cqr_schlyrs_est - (1.96*cqr_schlyrs_sd),
"Upper" = cqr_schlyrs_est + (1.96*cqr_schlyrs_sd),
"regtype" = "CQR")
result <- rbind(result, cqr_row)
}
return(result)
}
#Run CQR function for subset of quantiles
restricted_cqr_results <- restricted_cqr_func(data)
## Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
## Warning in rq.fit.br(x, y, tau = tau, ...): Solution may be nonunique
##Warning: Solution may be nonunique (Produced because of categorical variables in the model)
#Compare to all results
# restricted_cqr_results
# cqr_results %>%
# dplyr::filter(Quantile == 0.10 | Quantile == 0.25 | Quantile == 0.50 |
# Quantile == 0.75 | Quantile == 0.90)
What would we do if we wanted to learn how the unconditional outcome distribution changed in response to some population-level intervention? How can we contrast quantiles of the factual and counterfactual distributions?
Borah and Basu (2013) suggest that CQR can quantify these contrasts under some stringent assumptions: 1) the outcome is only a function of the exposure or 2) the exposure only induces a location shift in the outcome in the presence of other covariates. If an exposure interacts with other covariates in the data generating process – which is quite likely given how complex our world is – then CQR cannot estimate contrasts of the unconditional outcome distribution.
Several methods have been developed to estimate contrasts between the factual and counterfactual unconditional outcome distributions. For this workshop, we will focus on a method developed by Sergio Firpo, Nicole Fortin, and Thomas Lemieux (2009) (2-3). We focus on this method because it is a regression-based method (and so compares nicely with CQR and linear regression), is relatively easy to implement, and is quite popular in the economics literature.
Firpo’s unconditional quantile regression method quantifies the change in the \(\tau^{th}\) quantile of the unconditional outcome distribution changes for a small change in the exposure distribution.
At the heart of Firpo’s method is the recentered influence function (RIF). The RIF is the sum of a distributional statistic’s influence function and the distributional statistic itself. Let’s break this down a bit.
Distributional statistics can be anything from means to quantiles to Gini coefficients. Influence functions are a way of assessing the “robustness” of distributional statistics; they allow us to compute by how much, say, the mean of a distribution would change if we made a “small” change or addition to the existing distribution. Each distributional statistic has an influence function specific to it, and the influence function for the \(\tau^{th}\) quantile of Y is \(IF(y; q_{\tau}) = \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}\), where \(q_{\tau}\) is the value of a random variable \(Y\) at the \(\tau^{th}\) quantile, \(I(y \leq q_{\tau})\) is an indicator function for an individual falling above or below \(q_{\tau}\), and \(f_y(q_{\tau})\) is the density of Y at the \(\tau^{th}\) quantile. Firpo then defines the RIF for the \(\tau^{th}\) quantile as \(RIF(y;q_{\tau}) = q_{\tau} + IF(y; q_{\tau}) = q_{\tau} + \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}\). A nice property of the RIF is that its mean equals the distributional statistic of interest. Note that in the case of quantiles: \(E[RIF(y; q_{\tau})] = E[q_{\tau} + \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}] = E[q_{\tau}] + E[\frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}] = q_\tau\) because \(E[\frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})}]=0\) (since the numerator of this equation is \(\tau - \tau\) in expectation).
Firpo then proposes using the RIF as the outcome in a regression model to estimate the change in the \(\tau^{th}\) quantile of the unconditional outcome distribution for a small change in the unconditional distribution of the exposure. They call this modeling strategy the RIF-regression and propose several ways of estimating the RIF-regression, one of which is OLS.
The Firpo method has three main steps:
First, we will use the package dineq
to estimate the recentered influence function (RIF) at certain quantiles
of the unconditional outcome distribution. Note that RIF values can be
found for any quantile between (0,1).
#install.packages("dineq")
library(dineq) #RIF package
#RIF at q25
data$rif_q25 <- rif(data$sbp, weights = NULL, method = "quantile",
quantile = 0.25)
data$rif_q25_r <- round(data$rif_q25, 2)
rif25 <- ggplot(aes(x = factor(rif_q25_r)), data = data) +
geom_bar(width = 0.2, fill = "#F2494C") +
labs(x = "RIF Value",
y = "Count",
title = "25th Quantile") +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
strip.background = element_blank(),
text = element_text(size = 15))
#RIF at q50
data$rif_q50 <- rif(data$sbp, weights = NULL, method = "quantile",
quantile = 0.50)
data$rif_q50_r <- round(data$rif_q50, 2)
rif50 <- ggplot(aes(x = factor(rif_q50_r)), data = data) +
geom_bar(width = 0.2, fill = "#468C8A") +
labs(x = "RIF Value",
y = "Count",
title = "50th Quantile") +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
strip.background = element_blank(),
text = element_text(size = 15))
#RIF at q90
data$rif_q75 <- rif(data$sbp, weights = NULL, method = "quantile",
quantile = 0.75)
data$rif_q75_r <- round(data$rif_q75, 2)
rif75 <- ggplot(aes(x = factor(rif_q75_r)), data = data) +
geom_bar(width = 0.2, fill = "#FF9300") +
labs(x = "RIF Value",
y = "Count",
title = "75th Quantile") +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
strip.background = element_blank(),
text = element_text(size = 15))
#Combine
rif_vals <- data %>%
ungroup() %>%
dplyr::select(id, ends_with("_r")) %>%
rename("q25" = "rif_q25_r",
"q50" = "rif_q50_r",
"q75" = "rif_q75_r") %>%
pivot_longer(!id, names_to = "Quantile", values_to = "RIF_Values")
rif_comb <- ggplot(aes(x = RIF_Values, fill = factor(Quantile)),
data = rif_vals) +
geom_bar(width = 5) +
labs(x = "RIF Value",
y = "Count",
fill = "Quantile",
title = "25th, 50th, and 75th Quantiles") +
scale_fill_manual(values = c("#F2494C", "#468C8A", "#FF9300")) +
scale_x_continuous(breaks = seq(25, 300, by = 25)) +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
strip.background = element_blank(),
text = element_text(size = 15))
grid.arrange(rif25, rif50, rif75, rif_comb, ncol = 3)
Using the 25th quantile as an example, we will show how the expected value of the recentered influence function equals the \(\tau^{th}\) quantile.
\[RIF(SBP_i;q_{0.25}) = q_{0.25} + \frac{\tau - I(SBP_i \leq q_{0.25})}{f_y(q_{0.25})} = 114 + \frac{\tau - I(SBP_i \leq 114)}{f_y(114)} = \begin{equation} \begin{cases} 75.94, I(SBP_i \leq 114) \\ 126.69, I(SBP_i > 114) \end{cases} \end{equation}\]
At the 25th quantile, the recentered influence function takes on the value 75.94 for the 25% of participants who fall below the threshold (i.e., \(q_{0.25}\) = 114mmHg) and 126.69 for the 75% of participants who fall above the 25th threshold. The weighted average of these values (i.e., the mean value of the RIF) equals 114mmHg, which is the 25th quantile of the unconditional distribution of SBP in our data.
\(E[RIF(SBP_i;q_{0.25})] = (0.25)(75.94) + (0.75)(126.69) = 114 = q_{\tau}\)
Now that we have estimated the recentered influence function in the unconditional outcome distribution, we regress the RIF on the exposure and covariates. There are three ways of regressing the RIF on the exposure and covariates:
We will focus on RIF-OLS since it is the easiest to implement and the most practical, but rearranging the original recentered influence function to isolate the \(I(y > q_\tau)\) term is used in RIF-Logit and RIF-NP, where
\(RIF(y;q_{\tau}) = q_{\tau} + \frac{\tau - I(y \leq q_{\tau})}{f_y(q_{\tau})} = \frac{1}{f_y(q_{\tau})}I(y > q_\tau) + q_\tau - \frac{1-\tau}{f_y(q_{\tau})}\)
We will go over an example of fitting the RIF-OLS regression for SBP at the 25th quantile of the unconditional distribution. Specifically, we will fit the equation:
\(E[RIF(SBP_i;q_{0.25})|X,C] = \alpha_{0, 0.25} + \alpha_{1, 0.25}x_i + \lambda_{0.25}'C\)
Where \(\lambda_{0.25}\) is a vector of coefficients for the confounders.
If we take the expectation of the expression above, we iterate the expectation on the left hand side of the equation and distribute the expectation of the right hand, leaving us with
\(E[E[RIF(SBP_i;q_{0.25})|X,C]] = E[\alpha_{0, 0.25} + \alpha_{1, 0.25}x_i + \lambda_{0.25}'C]\)
\(E[RIF(SBP_i;q_{0.25})|X,C] = \alpha_{0, 0.25} + \alpha_{1, 0.25}E[X] + \lambda_{0.25}'E[C]\)
Where the left hand side of the equation is the expectation of the RIF (i.e., \(q_{0.25}\)), \(\alpha_{0, 0.25}\) is the intercept of the sample regression line and \(\alpha_{1, 0.25}\) is the slope of the sample regression line.
Firpo’s method “tricks” a regression model into modeling quantiles of the unconditional outcome distribution by using the recentered influence function as the outcome
Coefficients of the RIF-OLS model are interpreted differently than traditional linear regression coefficients. Since \(E[\widehat{RIF(SBP_i;q_{\tau})}] = \widehat{\alpha_{0, \tau}} + \widehat{\alpha_{1, \tau}}E[X] + \widehat{\lambda_{\tau}'}E[C]\), \(\widehat{\alpha_{1, \tau}}\) is the estimated change in the \(\tau^{th}\) quantile of the unconditional distribution of \(Y\) for a unit change in the mean of the unconditional distribution of the exposure \(X\).
RIF-OLS estimates the change in the outcome’s uncondtional distribution for a unit change in the mean of the unconditional distribution of the exposure
In the context of our example, unconditional quantile regression answers the question: By how much does the \(\tau^{th}\) quantile of the unconditional SBP distribution change if the mean of the unconditional education distribution changed by one unit?
There are two approaches to estimating the standard error for RIF regressions. The first – and preferred – method is the bootstrap. When bootstrapping, you resample with replacement 500+ times, fit RIF-OLS regression in each sample, then use the 2.5th and 97.5th quantiles of the “sampling distribution” (i.e., the 500+ samples).
The second method is to estimate heteroskedasticity robust standard errors. These can be estimated the same way we estimate heteroskedastic robust standard errors for linear regression, by using \(se(\hat{\beta})=(X'X)^{-1}X' \Omega X(X'X)^{-1}\) where \(\Omega\) is the variance-covariance matrix
\[ \Omega = \left[\begin{array}{cccc} \sigma_1^2 & 0 & ... & 0 \\ 0 & \sigma_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \sigma_n^2 \end{array}\right] = \left[\begin{array}{c} \epsilon_1^2 & 0 & ... & 0 \\ 0 & \epsilon_2^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & \epsilon_n^2 \end{array}\right] = \left[\begin{array}{c} (y_1-\hat{y_1})^2 & 0 & ... & 0 \\ 0 & (y_2-\hat{y_2})^2 & ... & 0 \\ \vdots & 0 & \ddots & \vdots \\ 0 & ... & ... & (y_n-\hat{y_n})^2 \end{array}\right] \]
We will use the dineq
package to run our unconditional quantile regressions using Firpo’s
recentered influence function method (RIF-OLS). We will only show
bootstrapped confidence intervals since it is the preferred method for
standard error adjustments.
data$rif25 <- rif(data$sbp, weights = NULL, method = "quantile",
quantile = 0.25)
uqr_25 <- lm(rif25 ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data)
#Bootstrap standard error
uqr_25_boot <- lm.boot(uqr_25, R = 50)
boot_sum_25 <- summary(uqr_25_boot)
adj_uqr_25 <- c("Estimate" = boot_sum_25$orig.lm$coefficients["schlyrs_c"],
"Bootstrapped Std. Error" = boot_sum_25$stdev.params["schlyrs_c"])
adj_uqr_25
## Estimate.schlyrs_c Bootstrapped Std. Error.schlyrs_c
## -0.4941319 0.1165962
summary(uqr_25)$coefficients["schlyrs_c", ] #Unadjusted
## Estimate Std. Error t value Pr(>|t|)
## -4.941319e-01 1.039239e-01 -4.754749e+00 2.018277e-06
Notice that our non-adjusted “schlyrs_c” standard error is slightly different from the bootstrapped “schlyrs_c” standard error.
Below, we also provide a function that loops through and fits unconditional quantile regressions at the 1st to 99th quantiles of the unconditional outcome distribution.
uqr_func <- function(data){
result <- data.frame()
for(i in seq(0.01, 0.99, by = 0.01)){
i <- round(i, 2)
data$rif <- rif(data$sbp, weights = NULL, method = "quantile",
quantile = i)
set.seed(123)
uqr_mod <- lm(rif ~ schlyrs_c + age + age2 + gender + race + southern +
mom_ed + dad_ed + year, data = data)
#Bootstrap standard error
uqr_boot <- lm.boot(uqr_mod, R = 500)
boot_sum <- summary(uqr_boot)
uqr_schlyrs_est <- as.numeric(boot_sum$orig.lm$coefficients["schlyrs_c"])
uqr_schlyrs_sd <- as.numeric(boot_sum$stdev.params["schlyrs_c"])
uqr_row <- data.frame("Quantile" = i,
"Estimate" = uqr_schlyrs_est,
"Lower" = uqr_schlyrs_est - (1.96*uqr_schlyrs_sd),
"Upper" = uqr_schlyrs_est + (1.96*uqr_schlyrs_sd),
"regtype" = "UQR")
result <- rbind(result, uqr_row)
}
return(result)
}
uqr_results <- uqr_func(data)
#Figure
ggplot(data = uqr_results, aes(x = Quantile, y = Estimate)) +
geom_line(color = "#FF9300", size = 1) +
geom_ribbon(aes(x = Quantile, ymin = Lower, ymax = Upper),
alpha = 0.45, fill = "#FF9300") +
geom_hline(yintercept = 0, alpha = 0.5) +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = xaxis,
y = yaxis,
title = "Association of educational attainment with systolic blood pressure") +
scale_x_continuous(breaks = c(0.01, seq(0.10, 0.90, 0.10), 0.99),
labels = c("q1", "q10", "q20", "q30", "q40", "q50",
"q60", "q70", "q80", "q90", "q99"))
We can also add our OLS estimate to the UQR plot so viewers can easily compare estimates at the mean to estimates at each quantile of the unconditional distribution.
ols_uqr <- rbind(ols_results, uqr_results)
#Figure
ggplot(data = ols_uqr %>% filter(regtype != "OLS"),
aes(x = Quantile, y = Estimate, group = regtype,
color = regtype, fill = regtype)) +
geom_line(alpha = 50, linewidth = 0.75) +
geom_ribbon(aes(ymin = Lower, ymax = Upper), alpha = 0.27, color = NA) +
geom_point(data = ols_uqr %>% filter(regtype == "OLS")) +
geom_errorbar(data = ols_uqr %>% filter(regtype == "OLS"),
aes(ymin = Lower, ymax = Upper), width = 0.01) +
geom_hline(yintercept = 0, alpha = 0.5) +
geom_vline(xintercept = -0.04, color = "gray", linetype = "dashed") +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
text = element_text(size = 10)) +
scale_color_manual(breaks = c("OLS", "UQR"),
labels = c("OLS", "UQR"),
values = c("#F2494C", "#FF9300")) +
scale_fill_manual(breaks = c("OLS", "UQR"),
labels = c("OLS", "UQR"),
values = c("#F2494C", "#FF9300")) +
scale_x_continuous(limits = c(-0.11, 1),
breaks = c(-0.1, 0.01, seq(0.10, 0.90, 0.10), 0.99),
labels = c("OLS", "q1", "q10", "q20", "q30", "q40", "q50",
"q60", "q70", "q80", "q90", "q99")) +
labs(x = "",
y = yaxis,
color = "Model", group = "Model", fill = "Model") +
theme(legend.position = "bottom",
legend.box.spacing = unit(0, "pt"),
legend.margin = margin(-10,0,0,0),
text = element_text(size = 15))
Just like with CQR, we can use model estimates from UQR to create counterfactual distributions of SBP. We impose our estimates at each quantile and plot what the new counterfactual distribution could look like.
#Creates dataset for plotting
counter_data <- function(results, data){
coef <- results %>%
dplyr::filter(regtype == "UQR") %>%
dplyr::select(Quantile, Estimate) %>%
rename("quant" = "Quantile",
"coef" = "Estimate")
coef$quant <- round(coef$quant * 100, 1) #Need to round for merging
red <- data %>%
dplyr::select(sbp) %>%
mutate(quant = xtile(sbp, n = 99))
merg <- right_join(coef, red, by = "quant")
cf <- merg %>%
mutate(sbp = sbp + coef,
group = "Counterfactual") %>%
select(group, sbp)
f <- merg %>%
mutate(group = "Factual") %>%
select(group, sbp)
cf_plot <- rbind(f, cf)
cf_plot$group <- relevel(factor(cf_plot$group), ref = "Factual")
return(cf_plot)
}
#Create counterfactual data
uqr_counter <- counter_data(uqr_results, data)
#Create graphic
ggplot(data = uqr_counter, aes(x = sbp, color = group, linetype = group)) +
geom_density(aes(y = after_stat(density)), linewidth = 1) +
theme(panel.background = element_rect(fill = 'white', colour = 'white'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom",
strip.background = element_blank(),
text = element_text(size = 20)) +
labs(color = "Distribution",
linetype = "Distribution",
x = "SBP (mmHg)",
y = "Density") +
scale_color_manual(values = c("Factual" = "#468C8A",
"Counterfactual" = "#FF9300")) +
scale_linetype_manual(values = c("Factual" = "solid",
"Counterfactual" = "dashed"))
We provide a quick table to compare the differences between linear regression, conditional quantile regression, and unconditional quantile regression.
| Linear Regression | Conditional Quantile Regression | Unconditional Quantile Regression | |
|---|---|---|---|
| Model | \(E[Y|X] = X'\beta\) | \(Q_\tau(Y|X)=X'\beta_\tau\) where \(\tau = (0,1)\) | \(E[RIF(Y;q_\tau)|X]=X'\beta_\tau\) where \(\tau = (0,1)\) |
| Error | Standard assumptions about the error term (iid, normal distribution) are usually violated | No distributional assumption is made about the error | No explicit assumption, but likely the same assumptions as the estimation strategy used |
| Estimating coefficients | \(\hat{\beta}=(X'X)^{-1}X'Y\) (analytic solution) | \(min_\hat{\beta}\sum_{i=1}^{N} \rho_\tau(y_i - x_i^{'}\hat{\beta})\) (use linear programming methods) | Method of estimation depends on the estimator used (e.g., RIF-OLS) |
| Estimating standard errors | Several estimators are available + bootstrap | Several estimators are available and key point is that we need to estimate the error density + bootstrap | Same estimators available as would be for choice of estimator |
| Interpretation | One unit increase in the independent variable of interest is associated with a \(\hat{\beta}\) unit change in the conditional mean of the outcome | One unit increase in the independent variable of interest is associated with a \(\hat{\beta}\) unit change in the \(\tau^{th}\) quantile of the conditional outcome distribution | One unit increase in the mean of the independent variable of interest is associated with a \(\hat{\beta}\) unit change in the \(\tau^{th}\) quantile of the unconditional outcome distribution |
It’s important to note that confidence intervals can get quite wide at the tails of the distribution, due to low density. Because of this, some researchers present only the 10th through the 90th quantiles, but we prefer to present results from the 1st through the 99th quantiles. Finally, remember that
Even if CQR and UQR estimates appear similar, they are estimating different estimands and connot be directly compared!
We’d love to hear feedback or answer any questions you still have! Please feel free to reach out to us through email at any time.
Jilly Hebert jilly.hebert@ucsf.edu
Department of Family and Community Medicine, University of California,
San Francisco
Amanda Irish amanda.irish@ucsf.edu
Department of Family and Community Medicine, University of California,
San Francisco
Anusha Vable anusha.vable@ucsf.edu
Department of Family and Community Medicine, University of California,
San Francisco
Chernozhukov V, Fernández-Val I, Melly B. Inference on Counterfactual Distributions. Econometrica. 2013;81(6):2205–68.
Firpo S, Fortin NM, Lemieux T. Unconditional Quantile Regressions. Econometrica. 2009;77(3):953–73.
Firpo S, Pinto C. Identification and Estimation of Distributional Impacts of Interventions Using Changes in Inequality Measures. J Appl Econom. 2016;31(3):457–86.
Koenker R, Bassett G. Regression Quantiles. Econometrica. 1978;46(1):33–50.
Koenker R, Chernozhukov V, He X, Peng L, editors. Handbook of Quantile Regression [Internet]. 1st Edition. New York: Chapman & Hall/CRC; 2017 [cited 2022 Sep 7]. 483 p. Available from: https://www.routledge.com/Handbook-of-Quantile-Regression/Koenker-Chernozhukov-He-Peng/p/book/9780367657574
Koenker R. Quantile Regression: 40 Years On. Annu Rev Econ. 2017;9(1):155–76.
Vable AM, Eng CW, Mayeda ER, Basu S, Marden JR, Hamad R, et al. Mother’s education and late-life disparities in memory and dementia risk among US military veterans and non-veterans. J Epidemiol Community Health. 2018 Dec;72(12):1162–7.
Wu Q, Tchetgen Tchetgen EJ, Osypuk TL, White K, Mujahid M, Maria Glymour M. Combining direct and proxy assessments to reduce attrition bias in a longitudinal study. Alzheimer Dis Assoc Disord. 2013 Sep;27(3):207–12.